Variance Estimation Using Refitted Cross-validation in Ultrahigh 

Dimensional Regression 



Jianqing Fan, Shaojun Guo and Ning Hao 
Abstract 

Variance estimation is a fundamental problem in statistical modeling. In ultrahigh dimen- 
sional linear regression where the dimensionality is much larger than sample size, traditional 
variance estimation techniques are not applicable. Recent advances on variable selection in ul- 
trahigh dimensional linear regression make this problem accessible. One of the major problems 
in ultrahigh dimensional regression is the high spurious correlation between the unobserved re- 
alized noise and some of the predictors. As a result, the realized noises are actually predicted 
when extra irrelevant variables are selected, leading to serious underestimate of the noise level. 
In this paper, we propose a two-stage refitted procedure via a data splitting technique, called 
refitted cross-validation (RCV), to attenuate the infiuence of irrelevant variables with high spu- 
rious correlations. Our asymptotic results show that the resulting procedure performs as well 
as the oracle estimator, which knows in advance the mean regression function. The simulation 
studies lend further support to our theoretical claims. The naive two-stage estimator and the 
plug-in one stage estimators using LASSO and SCAD are also studied and compared. Their 
performances can be improved by the proposed RCV method. 

Keywords: Data splitting; Dimension reduction; High dimensionality; Refitted cross-validation; 
Sure Screening; Variance estimation; Variable selection. 

^Jianqing Fan is Frederick L. Moore Professor of Finance, Department of Operational Research & Financial Engi- 
neering, Princeton University, Princeton, NJ 08544, U.S.A. (Email: jqfan@princeton.edu). Shaojun Guo is Assistant 
Professor, Institute of Applied Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of 
Sciences, Beijing 100190, P.R.China (E-mail: guoshaoj@amss.ac.cn). Ning Hao is Visiting Assistant Professor, De- 
partment of Mathematics, the University of Arizona, Tucson, AZ 85721, U.S.A. (Email: nhao@math.arizona.edu). 
The paper was supported by the NTH Grant R01-GM072611 and NSF Grants DMS-0704337 and was completed while 
Shaojun Guo and Ning Hao were a postdoctoral fellow at Princeton University. 
Dedicated to Peter J. Bickel on his occasion of the 70th birthday. 



1 



1 Introduction 



Variance estimation is a fundamental problem in statistical modeling. It is prominently 
featured in the statistical inference on regression coefficients. It is also important for variable 
selection criteria such as AIC and BIG. It provides also a benchmark of forecasting error when 
an oracle actually knows the regression function and such a benchmark is very important for 
forecasters to gauge their forecasting performance relative to the oracle. For conventional 
linear models, the residual variance estimator usually performs well and plays an important 
role in the inferences after model selection and estimation. However, the ordinary least 
squares methods don't work for many contemporary datasets which have more number of 
covariates than the sample size. For example, in disease classification using microarray 
data, the number of arrays is usually in tens, yet tens of thousands of gene expressions are 
potential predictors. When interactions are considered, the dimensionality grows even more 
quickly, e.g. considering possible interactions among thousands of genes or SNPs yields the 
number of parameters in the order of millions. In this paper, we propose and compare several 
methods for variance estimation in the setting of ultrahigh dimensional linear model. A key 
assumption which makes the high dimensional problems solvable is the sparsity condition: 
the number of nonzero components is small compared to the sample size. With sparsity, 
variable selection can identify the subset of important predictors and improve the model 
interpretability and predicability. 



Recently, there have been several important advances in model selection and estimation 
for ultrahigh dimensional problems. The properties of penalized likelihood methods such as 
the LASSO and SCAD have been extensively studied in high and ult rahigh dimensional re - 



gression. Var i ous us ef ul results hav e beeii o 



Zhao and Yu 



(2006); 



Bunea et al. 



(2007); 



jtained. See, for example , 



Zhang and Huand (120081): 



an and Pend (l2004h: 



Meinshausen and Yu 



fcooof ) 



Kim et al. 



(120081); 



Meier et al. 



(120081); 



Lv and Fan! (l2009l 



other important model selection tool is the Dantzig selector proposed by 



Fan and Lv (2009). An 



Candes and Tao 



( 120071 ) which can be easily re c ast as a linear prog r am. 



Bickel et al. 



t is closely related to LASSO, as 
(120091 ). iFan and Lv! (120081 ) showed that correlation ranking 



demonstrated by 

possesses a sure screening property in the Gaussian linear model with Gaussian covariates 
and proposed a sure indep e ndent screening (SIS) and iteratively sure independent screening 



(ISIS) method. 



Fan et al. 



pseudo-likelihood fr amework. 



Fan and Songl (l2010| ) have de- 



extended ISIS to a genera^ 
which includes generalized linear models as a special case 
veloped general conditions under which the marginal regression posses ses a sure sc r eenin g 
property in the context of generalized linear model. For an overview, see lFan and Lvl (l2010(). 
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In all the work mentioned above, the primary focus is the consistency of model selection 
and parameter estimation. The problem of variance estimation in ultrahigh dimensional set- 
ting has hardly been touched. A natural approach to estimate the variance is the following 
two-stage procedure. In the first stage, a model selection tool is applied to select a model 
which, if is not exactly the true model, includes all importa nt variables with m oderate model 



size (smaller than the sample size). In the terminology of iFan and Lvl (120081 ). the selected 
model has a sure screening property. In the second stage, the variance is estimated by or- 
dinary least squares method based on the selected variables in the first stage. Obviously, 
this method works well if we are able to recover exactly the true model in the first stage. 
This is usually hard to achieve in ultrahigh dimensional problems. Yet, sure screening prop- 
erties are much easier to obtain. Unfortunately, this naive two-step approach can seriously 
underestimate the noise level even with the sure screening property in the first stage due 
to spurious correlation inherent in ultrahigh dimensional problems. When the number of 
irrelevant variables is huge, some of these variables have large sample correlations with the 
realized noises. Hence, almost all variable selection procedures will, with high probability, 
select those spurious variables in the model when the model is over fitted, and the realized 
noises are actually predicted by several spurious variables, leading to serious underestimate 
of the residual variance. 

The above phenomenon can be easily illustrated in the simplest model, in which the true 
coefficient /3 = 0. Suppose that one extra variable is selected by a method such as the 
LASSO or SIS in the first stage. Then, the ordinary least squares estimator o"^ is 

n 

^^ = (i-7^)^B^^-^)'- w 

i=l 

where 7^ is the sample correlation of the spurious variable and the response, which is really 
the realized noise in this null model. Most variable selection procedures such as stepwise 
addition, SIS and LASSO will first select the covariate that has highest sample correlation 
with the response, namely, 7„ = maxj<p |cofr„(Xj, Y)\. In ot her words, th i s extr a variable is 



selected to best predict the realized noise vector. However, as lFan and Lvl ( 120081 ) stated, the 
maximum absolute sample correlation 7„ can be very large, which makes o"^ seriously biased. 
To illustrate the point, we simulated 500 data sets with sample size n = 50 and the number 
of covariates p = 10, 100, 1000 and 5000, with {Xj^^j^-^ and noise i.i.d. from the standard 
normal distribution. Figure [U^a) presents the densities of 7^ across the 500 simulations and 
Figure [T](b) depicts the densities of the estimator defined in ([1]). Clearly, the biases of 
become larger as p increases. 



3 




Figure 1: (a) Densities of the maximum absolute sample correlation 7„ for various p. (b) Densities 
of the corresponding estimates cj^ given by ([T]). The vertical line marks the true variance 1. All 
calculations are based on 500 simulations and the sample size n is 50. 

The bias gets larger when more spurious variables are recruited in the model. To illustrate 
the point, let us use the stepwise addition to recruit s variables to the model. Clearly, the 
realized noises are now better predicted, leading to even more severe underestimate of the 
noise level. Figure 2 depicts the distributions of spurious multiple correlation with the 
response (realized noise) and the corresponding naive two-stage estimator of variance for 
s = 1,2,5 and 10, keeping p = 1000 fixed. Clearly, the biases get much larger with s. For 
comparison, we also depict similar distributions based on SIS, which selects s variables that 
are marginally most correlated with the response variable. The results are depicted in Figure 
3 (a). While the biases based on the SIS method are still large, they are smaller than those 
based on the stepwise addition method, as the latter chose the coordinated spurious variables 
to optimize the prediction of the realized noise. 



A similar phenomenon was also obse rved i n classical model selection by [Yg (119981 ). To 



correct the effects of model selection, 



Yd (119981 ) developed a concept of generalized degree of 



freedom (GDF) but it is computationally intensive and can only be applied to some special 
cases. 



To attenuate the influence of spurious variables entered into the selected model and to 
improve the estimation accuracy, we introduce a refitted cross-validation (RCV) technique. 
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Figure 2: (a) Densities of spurious multiple correlation with the response for various number of 
spurious variables s. (b) Densities of the naive two-stage estimators of variance. All calculations 
are based on stepwise addition algorithm with 500 simulations, n = 50, and p = 1000. The vertical 
line marks the true variance 1. 



(a) (b) 




O 0.5 1 1.5 2 O 0.5 1 1.5 2 



Figure 3: (a) Densities of the variance estimators based on the naive two-stage approach with 
number of spurious variables s = 1, 2, 5, and 10. (b) Densities of RCV estimators of variance. All 
calculations are based on 500 simulations using SIS as a model selector and the sample size n is 
50. They show that the biases of the naive two-stage estimator are correctable. The vertical line 
marks the true variance 1. 
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Roughly speaking, we split the data randomly into two halves, do model selection using the 
first half dataset, and refit the model based on the variables selected in the first stage, using 
the second half data, to estimate the variance, and vice versa. The proposed estimator is 
just the average of these two estimators. The results of the RCV variance estimators with 
s = 1,2,5 and 10 are presented in Figure 3(b). The corrections of biases due to spurious 
correlation are dramatic. The essential difference of this approach and the naive two-stage 
approach is that the regression coefficients in the first stage are discarded and refitted using 
the second half data and hence the spurious correlations in the first stage are significantly 
reduced at the second stage. The variance estimation is unbiased as long as the selected 
models in the first stage contain all relevant variables, namely, possess a sure screening 
property. It turns out that this simple RCV method improves dramatically the performance 
of the naive two-stage procedure. Clearly, the RCV can also be used to do model selection 
itself, reducing the influence of spurious variables. 

To appreciate why, suppose a predictor has a big sample correlation with the response 
(realized noise in the null model) over the first half dataset and is selected into the model 
by a model selection procedure. Since the two halves of the dataset are independent and 
the chance that a given predictor is highly correlated with realized noise is small, it is very 
unlikely that this predictor has a large sample correlation with the realized noise over the 
second half of the dataset. Hence, its impact on the variance estimation is very small when 
refitted and estimating the variance over the second half will not cause any bias. The above 
argument is also true for the non-null models provided that the selected model includes all 
important variables. 

To gain better understanding of the RCV approach, we compare our method with the 
direct plug-in method, which comput e s the residual variance based on a regularized fit. This 



is inspired by iGreenshtein and Ritovi (120041 ) on the persistence of the LASSO estimator. An 
interpretation of their results is that such an estimator is consistent. However, there is a 
bias term of order 0{slogp/n) inherent in the LASSO-based estimator, when the regular- 
ization parameter is optimally tuned. When the bias is negligible, the LASSO based plug-in 
estimator is consistent. The plug- in variance estimation based on the general folded-concave 
penahzed least squares estimators such as SCAD are also discussed. In some cases, this 
method is comparable with the RCV approach. 

The paper is organized as following. Section [2] gives some additional insights into the 
challenges of high dimensionality in variance estimation. In Section [HI the RCV variance 
estimator is proposed and its sampling properties are estabhshed. Section H] studies the 
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variance estimation based penalized likelihood methods. Extensive simulation studies are 
conducted in Section [5] to illustrate the advantage of the proposed methodology. Section [6] 
is devoted to a discussion and the detailed proofs are provided in the Appendix. 

2 Insights into challenges of High Dimensionality in variance es- 
timation 

Consider the usual linear model 

= xf/3 + £„ or y = X/3 + £, (2) 

where y = (Yi, is an n-vector of responses, X = (xi,...,x„)^ is an n x p matrix 
of independent and identically distributed {i.i.d.) variables xi,..., x„, f3 = /3p)^ is a 

p- vector of parameters and e = {ei, ...,e„)"^ is an n-vector of i.i.d. random noises with mean 
and variance o"^. We always assume the noise is independent of predictors. For any index 
set M C {l,2,...,p}, (3j^ denotes the sub-vector containing the components of the vector 
f3 that are indexed by M, Xm denotes the sub- matrix containing the columns of X that 
are indexed by M, and Pm = Xm(X^^Xa/)~^XJ^ is the projection operator onto the linear 
space generated by the column vectors of Xa/. 

When p > or p n, it is often assumed that the true model Mq = {j : (3j ^ 0} is sparse, 
i.e. the number of non-zero coefficients s = \Mq\ is small. It is usually assumed that s is fixed 
or diverging at a mild rate. Under various sparsity assumptions and regularity conditions, 
the most popular variable selection tools such as LASSO, SCAD, adaptive LASSO, SIS 
and Dantzig selector possess various good properties regarding model selection consistency. 
Among these properties are the sure screening property, model consistency, sign consistency, 
weak oracle property and oracle property, from weak to strong. Theoretically, under some 
regularity conditions, all aforementioned model selection tools can achieve model consistency. 
In other words, they can exactly pick out the true sparse model with probability tending 
to one. However, in practice, these conditions are impossible to check and hard to meet. 
Hence, it is often very difficult to extract the exact subset of significant variables among a 
huge set of covariates. One of reasons is the spurious correlation, as we now illustrate. 

Suppose that unknown to us the true data generating process in model ([2]) is 

Y = 2Xi + 0.3X2 + e 



7 



where Xj is the n- dimensional vector of the reahzations of the covariate Xj. Furthermore, 
let us assume that {XjYj=i and e follow independently the standard normal distribution. As 
illustrated in Figure 1(a), where p is large, there are realizations of variables that have high 
correlations with e. Let us say cofr(X9, e) = 0.5. Then, Xg can even have a better chance 
to be selected than X2. Here and hereafter, we refer the spurious variables to those variables 
selected to predict the realized noise e and their associated sample correlations are called 
spurious correlations. 

Continued with the above example, the naive-two stage estimator will work well when 
the model selection is consistent. Since we may not get model consistency in practice and 
have no way to check even if we get it by chance, it is natural to ask whether the naive 
two-stage strategy works if only sure screening can be achieved in the first stage. In the 
aforementioned example, let us say a model selector chooses the set {Xi, X2, Xg}, which 
contains all true variables. However, in the naive two-stage fitting, Xg is used to predict s, 
resulting in substantial underestimate of o"^ = var(e). Upon both variables Xi and X2 are 
selected, all spurious variables are recruited to predict e. The more spurious variables are 
selected, the better e is predicted, the more serious underestimation of cr^ by the naive two 
stage estimation. 

We say a model selection procedure satisfies sure screening property if the selected model 
M with model size s includes the true model Mq with probability tending to one. Explicitly, 

P{M D Mo) -^1 as n^oc. 



The sure screening property is a crucial criterion when evaluating a model selection pro- 
cedure for high or ultrahigh dimensional problems. Among all model consistent properties, 
the sure screening property is the weakest one and the easiest to achieve in practice. 

Let us demonstrate the naive two-stage procedure in detail. Assume that the selected 
model M in the first stage includes the true model Mq. The ordinary least squares estimator 
0"^^ at the second stage, using only the selected variables in M, is 

2 y^(I. - P,^)y e^jln - Pm)s 

= ' — = ' — ' y^^) 

^" n — s n — s 

where is the n x n identity matrix. How does this estimator perform? To facilitate the 
notation, denote the naive estimator by a^. Then, the estimator can be written as 
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where 7^ = P ^^eje^ e. Let us analyze the asymptotic behavior of this naive two-stage 
estimator. 

Theorem 1 Under the assumptions (A1)-(A2) together with (A3)-(A4) or (A5)-(A6) in the 
Appendix, we have 

1. If a procedure satisfies the sure screening property with s < bn where bn = o{n) is 
given in Assumption (A2), then o'^l{\ — 7^) converges to in probability as n 00. 
Furthermore, 

where )■ ' stands for 'convergence in distribution'. 

2. If, in addition, logp/n = 0(1), then % = Op{\J s \ogp/n) . 

It is perhaps worthwhile to make a remark about Theorem [TJ 7^ plays an important 
role on the performance of a^. It represents the fraction of bias in a^. The slower 7„ 
converges to zero, the worse performs. Moreover, if 7^ converges to a positive constant 
with a non-negligible probability, it will lead to an inconstant estimator. The estimator can 
not be root-n consistent if s\ogp/y/n — )■ 00. This explains the poor performance of a^, as 
demonstrated in Figures [2] and [HI While Theorem [T] gives an upper bound of 7„, it is often 
sharp. For instance, if {Xj}^^-^ and e are i.i.d. standard normal distribution and s = 1, then 
7„ is just the maximum absolute sample correlation between e and {Xj}j^i. Denote the jth 
sample correlation by %j = cofr„(Xj,e), j = 1, ■ ■ ■ ,p. Applying the transformation T(r) = 
r/\/l — r^, we get a sequence {^nj = \/n — 'lT{^nj)Yj=\ with i.i.d. Student's t distribution 
with n — 2 degrees of freedom. Simple analysis on the extreme statistics of the sequences 
{inj} and {7„j} shows that for any c > such that log(p/c) < n + 2, we have 

P|7n > Vlog(p/c)/(2n)} > 1 - exp(-c), (4) 

which implies the sharpness of Theorem 1 in this specific case. Furthermore, when logp = 

% = ^J2\ogp/n{l + Op{l)} 
with the limiting distribution is given by 

p|a/2 \og2p[^^ri - d2p) < — > exp{-exp(-x)}. (5) 

^j^g^g d,=V2W^-^V^ jL]2Mll_ Appendix A.5 for details. 
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3 Variance estimation based on refitted cross-validation 



3.1 Refitted cross-validation 



In this section, we introduce the refitted cross-vahdation method to remove the infiuence of 
spurious variables in the second stage. The method requires only that the model selection 
procedure in stage one has a sure screening property. The idea is as follows. We assume the 
sample size n is even for simplicity and split randomly the sample into two groups. In the 
first stage, an ultrahigh dimensional variable selection method like SIS is applied to these 
two datasets separately, which yields two small sets of selected variables. In the second 
stage, the ordinary least squares (OLS) method is used to re-estimate the coefficient (3 and 
variance a^. Different from the naive two-stage method, we apply again OLS to the first 
subset of the data with the variables selected by the second subset of the data and vice 
versa. Taking the average of these two estimators, we get our estimator of cr^. The refitting 
in the second stage is fundamental to reduce the infiuence of the spurious variables in the 
first stage of variable selection. 

To implement the above idea of the refitted cross-validation, consider a dataset with 
sample size n, which is randomly split to two even datasets (y^^^X^^^) and (y^^^X^^)). 
First, a variable selection tool is performed on (y^^^X^^^) and let Mi denote the set of 
selected variables. The variance cr^ is then estimated on the second dataset (y^^^X^-*), 
namely, 

^^_(y(2)f(V-Pg)y(2) 



n/2 - \Mi\ 



where Pg = Xg(xg^x5^|)-ix2f ■ Similarly, we use the dataset two (y(2),X(2)) to 



select the set of important variables M2 and the first dataset (y'^^^X^-') for estimation of 
(7^, resulting in 



n/2-\M2\ 

We define the final estimator as 

4cv = (^i' + ^2)/2. (6) 
An alternative is the weighted average defined by 
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When |Mi| = IM 



2h 



we have a^cv 



"^WRCV- 



In the above procedure, although Mi includes some extra unimportant variables besides 
the important ones, these extra variables will play minor roles when we estimate cx^ using 
the second dataset along with refitting since they are just some random unrelated variables 
over the second dataset. Furthermore, even when some important variables are missed in 
the first stage of model selection, they have a good chance being well approximated by the 
other variables selected in the first stage to reduce modeling biases. Thanks to the refitting 
in the second stage, the best linear approximation of those selected variables is used to 
reduce the biases. Therefore, a larger selected model size gives us not only a better chance 
of sure screening, but also a way to reduce modeling biases in the second stage when some 
important variables are missing. This explains why the RCV method is relatively insensitive 
to the selected model size, demonstrated in Figures 3 and 6 below. With a larger model 
being selected in the stage one, we may lose some degrees of freedom and hence get an 
estimator with slightly larger variance than the oracle one at finite sample. Nevertheless, 
the RCV estimator performs well in practice and asymptotically optimal when s = o{n). The 
following theorem gi ves the prop e rty o 



the RCV estimator. It requires only a sure screening 
property, studied by lFan and Lvl (120081) for normal mu ltiple regression. iFan and Song] (120101 ) 



for generalized linear models, and IZhao and Lil (120101 ) for Cox regression model. 



Theorem 2 Assume the regularity conditions (Al) and (A2) hold and Ele'^] < oo. // a 
procedure satisfies the sure screening property with si < hn and S2 < fen; then 

V^{alcv-cy')^N{Q,E[e']-a'). (8) 



Theorem [2] reveals that the RCV estimator of variance has an oracle property. If the 
regression coefficient /3* is known by oracle, then we can compute the realized noise Ei = 
Yi — xf /3* and get the oracle estimator 

n 

al = n-'Y.^Y,-^Jn'- (9) 

i=l 

This oracle estimator has the same asymptotic variance as a^cv- 

There are two natural extensions of the aforementioned refitted cross-validation tech- 
niques. 

K-fold data splitting: The first natural extension is to use K-fold data splitting tech- 
nique rather than two-fold spliting. We can divide the data into K groups, select the model 
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with all groups except one, which is used to estimate the variance with refitting. We may 
improve the sure screening probability with this K-fold method since there are now more 
data in the first stage. However, there are only n/K data points on the second stage for 
refitting. This means that the number of variables selected in the first stage should be much 
less than n/K. This makes the ability of sure screening hard in the first stage. For this 
reason, we work only on the two-fold refitted cross-validation. 

Repeated data splitting: There are many ways to randomly split the data. Hence, 
many RCV variance estimators can be obtained. We may take the average of the resulting 
estimators. This reduces the influence of the randomness in the data splitting. 

Remark 1. The RCV procedure provides an efficient method for variance estimation. 
Those technical conditions in Theorem 2 may not be weakest possible. They are imposed to 
facilitate the proofs. In particular, we assume that -P|0min(&n) > Ao| = 1 for all n, which 
implies that the selected variables in stage one are not highly correlated. Other methods 
beyond least squares can be applied in the refitted stage when those assumptions are possibly 
violated in practice. For instance, if some selected variables in stage one are highly correlated 
or the selected model size is relatively large, ridge regression or penalization methods can 
be applied in the refitted stage. Moreover, if the density of the error e seems heavy-tailed, 
some classical robust methods can also be employed. 

Remark 2. The paper focuses on variance estimation under the exact sparsity as- 
sumption and sure screening property. It is possible to extend our results to nearly sparse 
cases. For example, the parameter (3 is not sparse but satisfies some decay condition such as 
'Ylik — ^ ^'^^ some positive constant C. In this case, we do not have to worry too much 
whether a model selection procedure can recover small parameters. In this case, so long as a 
model selection method can pick up a majority of all variables with large coefficients in the 
first stage, we would expect that the RCV estimator performs well. 

3.2 Applications 

Many statistical problems require the knowledge of the residual variance, especially for high 
or ultra-high dimensional linear regression. Here we brief a couple of applications. 

(a) Constructing confidence intervals for coefficients. A natural application is to use 
estimated (Trcv to construct confidence intervals for non- vanishing estimated coe fficients. For 
example, it is well known that the SCAD estimator possesses an oracle property (jFan and Li . 
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2001 : Fan and Lv . 2009 ). Let f3]^;j be the SCAD estimator, with corresponding design matrix 



XjQ^. Then, for each j G M, 1 — a confidence interval for /3j is 



/3j ± Zi_a/2Cjd'iiCYj 



(10) 



in which Cj is the diagonal element of the matrix (X^^X^;^)"^ that corresponds to the j*'* 
variable. Our simulation studies show that such a confidence interval is accurate and has a 
similar performance to the case where a is known. 

The confidence intervals can also be constructed based on the raw materials in the refitted 
cross validation. For example, for each element in M = Mi fl we can take the average 
of the refitted coefficients as the estimate of the regression coefficients in the set M, and 
(Si+S2)(3"^cv/4 as the corresponding estimated covariance matrix, where Si = (X.^^^'^X.^^^)^^ 
is computed based on the first half of the data at the refitting stage and S2 = (X^-'-^X^^)^^ 
is computed based on the second half of the data. In addition, some 'cleaning' techn iques 
through p- values ca n be al so applied here. In particular, IWasserman and Roederl (120091 ) and 



Meinshausen. et al. 



(|2009[ ) studied these techniques to reduce the number of falsely selected 



variables substantially. 

(b) Genomewide association studies. Let Xj be the coding of the jth Single Nucleotide 
Polymorphism (SNP) and Y be the observed phenotype (e.g. height or blood pressure) or 
the expression of a gene of interest. In such a quantitative trait loci (QTL) or eQTL study, 
one frequently fits the marginal linear regression 



EiY\X,) = a, + P,X, 



:ii) 



based on a sample of size n individuals, resulting in the marginal least-squares estimate Pj. 
The interest is to test simultaneously the hypotheses Hqj : (3j = {j = 1, ■ ■ ■ ,p). If the 



conditional 



distributi o n of Y given Xi, 



be shown ( Han et all I2OIO ) that 



■ ,Xp is N{fi{Xi,--- , 
4)^~iV((/3i,...,/3, 



Xp),a 



"^^ then it can easily 
^,a'^S/n), where the 



element of S is the sample covariance matrix of X^ and Xj divided by their sample variances. 
With 0"^ estimated by the RCV, the P-value for testing individual hypothesis Hqj can be 
computed. In addition, the dependence of the least-squares estimates is now known and 



hence the false discovery proportion or rate can be estimated and controlled (IHan et al. 
2OI0I). 



(c) Model selection. Popular penalized approaches for variable selection such as LASSO, 
SCAD, adaptive LASSO and elastic-net often involve the choice of tuning or regularization 
parameter. A proper tuning parameter can improve the efficiency and accuracy for variable 
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selection. Several criteria, such as Mallaw's Cp, AIC and BIC, are constructed to choose 
tuning parameters. All these criteria rely heavily on a common parameter, the err or variance. 



As an illustration, consider estimating the tuning parameter of LASSO (See also lZou. et al. 



( 120071 )). Let A be the tuning parameter with the fitted value (ly^ = Then AIC and 

BIC for the LASSO are written as 

A1C(Aa,^') = "^L^'"' + -4f(AA) 

and 



na^ n 

It is easily seen that the variance has an important impact on both AIC and BIC. 



4 Folded-concave Penalized least squares 

In this section, we discuss some related methods on variance estimation and their corre- 
sponding asymptotic properties. The oracle estimator of cr^ is 



n 



1=1 



A natural cand idate to estimate the varianc e is R0), where ^ is the LASSO or SCAD 
estimator of f3* . iGreenshtein and Ritovl ( 120041 ) showed the persistent property for the LASSO 
estimator Their result, interpreted in the linear regression setting, implies R0i) — > 
i?(/3*) = 0"^ in probability, where R{(3) = E(Y — ^f3y. In fact, it is easy to see that their 
result implies 

R0^) -> = Rif3*). 
In other words, R0l) is a consistent estimator for the variance. 



Recall the LASSO estimator is defined as 



1 " 

= argmin - V (F, - xf /3)' + A„||/3||i. 
/3 " ^ 



(12) 



2=1 



consistent, 



Greenshtein and Ritovl ( 20041 ) suggested A„ = o{{n/ \ogp)2} 



To make R{$l] 
asymptotically. 

is chosen by cross-validation. Therefore, we define the LASSO variance estimator o"| by 



Wasserman and Roederl (120091 ) showed the consistency still holds when A„ 



1 " 

n - Si ^ V 

1=1 



(13) 
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where sl = #{j : ^ 0}. 

We shall see that a1 usually underestimates the variance due to spurious correlation, 
as the LAS S O sha res a similar spirit of the stepwise addition (see the LARS algorithm by 



Efron et al 



(120041 ) )■ Thus, we also consider the leave-one-out LASSO variance estimator 

1 " 9 



n 

i=l 



where is the LASSO estimator using all samples except the ith. one. In practice, i^-fold 
{K equals 5 or 10) cross- validated LASSO estimator is often used and shares the same spirit 
as dH]). We divide the dataset into K parts, say Vi, ...,Vk and define 



f^cvL = mm 



A 

k=l ii^Vk 



where (3^ is the LASSO estimator using all data except ones in with tuning parame- 
ter A. This estimator differs from the plug-in method (fT3!) in that multiple estimates from 
training samples are used to compute residuals from the testing samples. We will see that 
the estimator cr^y-^ is typically closer to R0l) than R0j^), but it usually somewhat overes- 
timates the true variance from our simulation experience. The following theorem shows the 
convergence rate for the LASSO estimator. 

Theorem 3 Suppose the assumptions (Al) - (A4) and (Al) hold. If the true model size 
s = o(n"") for some < 1; then, we have 

a\ — a"^ = Op(max(?7,"^/^, s logp/n)). 

// s logp/ ^/n — 7- 0, we have 

^i&l - a^) N{0, E[e^] - a^). 



The factor slogp/n reflects the bias of the penalized Li-estimator. It can be non- 
negligible. When it is negligible, the plug-in LASSO estimator possesses also the oracle 
property. In general, it is difficult to study the asymptotic distribution of the LASSO esti- 
mator when the bias is not negligib le. In particular , we c an not obtain the standard error 
for the estimator. Even for finite p, iKnight and Ful (120001 ) investigated the asymptotic dis- 
tribution of LASSO-type estimators but it is too complicated to be applied for inference. 
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To tackle this difficulty, iPark and Casellal (|2008[ ) and iKyung. et al\ (120101 ) used the hier- 
archical Bayesian formulatio n to produce a valid standard error for LASSO estimator, and 
Chatterjee and Lahiril ( 120101 ) proposed a modified bootstrap method to approximate the dis- 
tribution of LASSO estimator. But it is unclear yet whether or not their methods can be 
applied to high or ultra-high dimensional setting. 



Recently, iFan and Lvl ( 120091 ) studied the oracle properties of non-concave penalized likeli- 
hood method in the ultrahigh dimensional setting. Inspired by their resul ts, the variance 
can be consistently and efficiently estimated. The SCAD penalty p\{t) (IFan and Lil . l200ll ) 
is the function whose derivative is given by 

p',(t) = \{l{t < A) + > A)},t > 0,a > 2, 

(a — 1)A 



where a = 3.7 is often used. Denote by 



Qn,A„(/3) = lly - X/3f + 2n^p,„(|/3,|), 



(16) 



and let 0^qp^-q be a local minimizer of {f3) with respect to j3. Thus, the variance cr^ 
can be estimated by 



0". 



SCAD 



n 



1 



i=l 



where s = #{j : (/3scAD)i 7^ 0}. 



The following theorem shows the oracle property and convergence rate for the SCAD 
estimator. 

Theorem 4 Assume \ogp = 0{n"°) and the true model size s = 0{n""), where qq G 
[0,1). Suppose that the assumptions (Al), (A3)-(A4) (or (A5)-(A6)) and (A8)-(A9) in the 
Appendix are satisfied. Then, 



1. (Model Consistency) There exists a strictly local minimizer /3„ 
Qn,A„(/3) such that 

{j ■ 4- ^ 0} = Mo 

with probability tending to one; 

2. (Asymptotic Normality) With this estimator (3^, we have 



0u---Jpf of 



16 



Theorem H] reveals that, if A„ is chosen reasonably, o'^qad works as well as the RCV 
estimator (t^qy and better than a1. However, it is hard to achieve this oracle property 
sometimes. 

5 Numerical Results 
5.1 Simulation Study 

In this section, we illustrate and compare the finite sample performance of the methods 
described in the last three sections. We applied these methods to three examples: the null 
model and two sparse models. The null model (Example 1) is given by 

Y = x^O + e, e~A^(0,l) (17) 

where Xi, X2, ■ ■ ■ , Xp are i.i.d. random variables, following the standard Gaussian distri- 
bution. This is the sparsest possible model. The second sparse model (Example 2) is given 
by 

Y = b{Xi+X2 + X^)+e, 6^N{0,1) (18) 

with different b representing different levels of signal-to-noise ratio (SNR). The covariates 
associated with model f|T8|) are jointly normal with equal correlation p, and marginally 
iV(0,l). 

The third sparse model (Example 3) is more challenging, with 10 nontrivial coefficients, 
{(3j I j = 1, 2, 3, 5, 7, 11, 13, 17, 19, 23}. The covariates are jointly normal with cov(Xj, Xj) = 
0.51^-^1 The nonzero coefficients vector is 

b ■ (1.01, -0.06, 0.72, 1.55, 2.32, -0.36, 3.75, -2.04, -0.13, 0.61) 

where b varies to fit different SNR levels. The random error follows the standard normal 
distribution. 

In each of these settings, we test the following four methods to estimate the variance. 

Method 1: Oracle estimator (Q, which is not a feasible estimator whose performance 
provides a benchmark. 

Method 2: Naive two-stage method, denoted by N-SIS, if SIS is employed in the model 
selection step. 
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Method 3: Refitted cross-validation variance estimator ([6]), denoted by RCV. 



Method 4: One step method via penalized least squares estimators. We introduced this 
method in Section [Hand recommended two formulas to estimate the variance, direct plug-in 
method (P) like formula ( ITSll and cross-validation method (CV) like formula ( JTSll . 

In methods 2-4, we employed (I)SIS, SCAD, LASSO as our model selection tools. For 
SCAD and LASSO, the tuning parameters were chosen by 5-fold or 10-fold cross-validation. 
For (I) SIS, the predetermined model size is always taken to be 5 in the n ull model and n/4 in 
the sparse model, unless specified explicitly. The principled method of IZhao and Lil (120101 ) 
can be employed to automatically choose the model size. 

Example 1. Assume the response Y is independent of all predictors Xj's, which follow 
i.i.d. standard Gaussian distribution. We consider the cases when numbers of covariates vary 
from 10, 100 to 1000 and the sample sizes equal 50, 100 and 200. The simulation results are 
based on 100 replications and summarized in Table 1. In Figure HI three boxplots are listed 
to compare the performance of different methods for the case n = 50, 100, 200 and p = 1000. 
From the simulation results, we can see the improved two-stage estimators (RCV-SIS and 
RCV-LASSO) are comparable with the oracle estimator and much better than the naive 
ones, especially in the case when p ^ n. This coincides with our theoretical result. RCV 
improves dramatically the naive (natural) method, no matter SIS or LASSO is used. 



n = 50 



n = 100 



n = 200 




Oracle N-SIS N-LASSO RCV-SIS RCV-LASSO 



Oracle N-SIS N-LASSO RCV-SIS RCV-LASSO 



Oracle N-SIS N-USSO RCV-SIS RCV-LASSO 



Figure 4: Boxplots of when data are generated from the null model (llTh with p = 1000 and 
n = 50, 100 and 200. The number of simulation is 100. The horizontal line marks the true variance 
1. 

Example 2. We now consider the the model f|T8|) with {n,p) = (200, 2000), p = and 0.5. 
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Table 1: Simulation Results for Example 1: The bias (BIAS), Standard Error (SE) and Average 
Model Size (AMS) for oracle, naive and RCV two-stage procedures are reported below. 



p = io 





n = 50 


n = 100 


n = 200 


Method 


BIAS 


SE 


AMS 


BIAS 


SE 


AMS 


BIAS 


SE 


AMS 


Oracle 


0.006 


0.220 





-0.023 


0.144 





-0.015 


0.109 





N-SIS 


-0.072 


0.209 


5 


-0.064 


0.142 


5 


-0.030 


0.109 


5 


RCV-SIS 


0.017 


0.234 


5 


-0.029 


0.150 


5 


-0.013 


0.114 


5 


N-LASSO 


-0.052 


0.211 


1.08 


-0.051 


0.148 


1.01 


-0.028 


0.108 


0.94 


RCV-LASSO 


-0.003 


0.219 


1.41 


-0.026 


0.149 


1.24 


-0.015 


0.110 


1.02 


p= 100 




n = 50 


n = 100 


n = 200 


Method 


BIAS 


SE 


AMS 


BIAS 


SE 


AMS 


BIAS 


SE 


AMS 


Oracle 


-0.011 


0.205 





0.023 


0.154 





-0.010 


0.154 





N-SIS 


-0.325 


0.151 


5 


-0.164 


0.135 


5 


-0.112 


0.135 


5 


RCV-SIS 


-0.004 


0.216 


5 


0.018 


0.165 


5 


-0.009 


0.165 


5 


N-LASSO 


-0.272 


0.319 


5.90 


-0.153 


0.279 


13.56 


-0.073 


0.279 


3.16 


RCV-LASSO 


0.032 


0.359 


4.67 


0.022 


0.171 


5.89 


-0.010 


0.171 


12.41 


p = iOOO 




11 = 50 


n = 100 


11 = 200 


Method 


BIAS 


SE 


AMS 


BIAS 


SE 


AMS 


BIAS 


SE 


AMS 


Oracle 


-0.011 


0.176 





-0.015 


0.130 





-0.015 


0.095 





N-SIS 


-0.488 


0.118 


5 


-0.314 


0.098 


5 


-0.192 


0.079 


5 


RCV-SIS 


-0.017 


0.211 


5 


-0.018 


0.144 


5 


-0.012 


0.098 


5 


N-LASSO 


-0.351 


0.399 


7.47 


-0.256 


0.330 


9.37 


-0.196 


0.251 


9.90 


RC\"-LASSC) 


-0.029 


0.2()() 


5.0:-! 


-0.022 


().i<S() 


8.27 


-0.011 


O.IO:-! 


8.79 
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Moreover, we consider three values of coefficients 6 = 2,6=1 and 6 = 1/ -\/3, corresponding 
to different levels of SNR vT2, a/3 and 1 for each case when p = 0. The results depicted 
in Table 2 are based on 100 replications (The results for 6=1 are presented in Figure [5] 
and are omitted from the table). The boxplots of all estimators for the case p = 0.5 and 
6=1 are shown in Figure [51 They indicate that the RCV methods behave as well as oracle, 
and much better than naive two-stage methods. Furthermore, the performance of the naive 
two-stage method depends highly on the model selection technique. The one-step methods 
perform also well, especially P-SCAD and CV-SCAD. P-LASSO and CV-LASSO behave 
slightly worse than SCAD methods. These simulation results lend further support to our 
theoretical conclusions in earlier sections. 

To test the sensitivity of the RCV procedure to the model size s and covariance structure 
among predictors, additional simulations have been conducted and their results are summa- 
rized in Figure E] and [71 From Figure [SI it is clear that RCV method is insensitive to model 
size s, as explained before Theorem 2. Figure [71 shows the RCV methods are also robust 
with respect to the covariance structure. In contrast, N-LASSO always underestimates the 
variance. 



p = 0.5 and b=1 



o 

_8_ o o 


O 00- - 


o @ 

1 ■ 1 O 

1 ° 

1 



Oracle N-SIS N-ISIS N-LASSO RCV-SIS RCV-ISIS RCV-LASSO P-SCAD CV-SCAD P-LASSO CV-LASSO 



Figure 5: Comparison of various methods for variance estimation in model (|18p with n = 200 and 
p = 2000. Presented are boxplots of tr^ based on 100 replications. 

To show the effectiveness of o"rcv in the construction of confidence intervals, we calculate 
the coverage probability of the confidence interval (11 01) based on 10,000 simulations. This 
was conducted for (32 and /^s with 6 = l/-\/3, 1, and 2 and p = and 0.5. To save the 
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p = and b = 1 



p = 0.5 and b= 1 




20 



40 60 
model size 



80 



20 



Oracle 
N-SIS 
N-LASSO 
RCV-SIS 
RCV-LASSO 




40 6 

model size 



— I — 
80 



Figure 6: The sensitivity of model size s on variance estimation. Presented are the medians of 
naive and RCV two-stage estimators when n = 200 and p = 2000 among 100 rephcations. 



b=i b=2 




Figure 7: The impact of covariance structure on variance estimation. Presented are the medians 
of naive and RCV two-stage estimators when n = 200 and p = 2000 among 100 rephcations for 
various p. 
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Table 2: Simulation results for Example 2 with n = 200, p = 2000: The bias (BIAS), Standard 
Error (SE), Average Model Size (AMS) and Sure Screening Probability (SSP) for each procedure 
are reported. 



6 = 2 





p = 


p = 0.5 


Method 


BIAS SE AMS SSP 


BIAS SE AMS SSP 


Oracle 


-0.014 0.089 3.000 1.000 


-0.014 0.090 3.000 1.000 


N-SIS 
N-ISIS 

N-LASSO 


-0.111 0.096 50.000 1.000 
-0.791 0.073 49.130 1.000 

-0.581 0.163 41.460 1.000 


-0.011 0.102 50.000 1.000 
-0.821 0.036 46.870 1.000 

-0.526 0.172 43.310 1.000 


RCV-SIS 
RCV-ISIS 

RCV-LASSO 


-0.030 0.132 50.000 1.000 
-0.017 0.113 25.770 1.000 
-0.004 0.130 34.230 1.000 


0.025 0.279 50.000 0.960 
-0.020 0.106 22.185 1.000 
-0.026 0.147 34.990 1.000 


P-SCAD 
CV-SCAD 
P-LASSO 
CV-LASSO 


-0.048 0.109 7.810 1.000 
0.000 0.095 7.810 1.000 

-0.102 0.195 41.460 1.000 
0.141 0.111 41.460 1.000 


-0.036 0.097 6.080 1.000 
0.001 0.096 6.080 1.000 

-0.113 0.164 43.310 1.000 
0.127 0.116 43.310 1.000 


b= 1/V3 




p = 


p = 0.5 


McIIkkI 


LUAS SE AMS SSI' 


ELVS SE AMS SSP 


Oracle 


-0.014 0.090 3.000 1.000 


-0.014 0.090 3.000 1.000 


N-SIS 
N-ISIS 

N-LASSO 


0.010 0.105 50.000 1.000 
-0.817 0.077 46.400 1.000 
-0.445 0.202 39.290 1.000 


0.046 0.107 50.000 0.980 
-0.809 0.099 46.250 1.000 

-0.381 0.239 37.140 1.000 


RCV-SIS 

RCV-ISIS 

RCV-LASSO 


0.017 0.164 50.000 0.880 
-0.002 0.122 22.225 0.970 
-0.029 0.147 33.470 0.990 


0.057 0.158 50.000 0.430 
0.113 0.161 22.445 0.150 
0.046 0.161 31.890 0.450 


P-SCAD 
CV-SCAD 
P-LASSO 
CV-LASSO 


-0.036 0.096 6.110 1.000 
0.003 0.096 6.110 1.000 

-0.097 0.171 39.290 1.000 
0.126 0.116 39.290 1.000 


-0.066 0.102 14.520 1.000 
0.079 0.124 14.520 1.000 

-0.089 0.171 37.140 1.000 
0.125 0.116 37.140 1.000 
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space of the presentation, we present only one specific case for /3i with 6 = 1 in Table 3. 

Table 3: Simulation results for Example 2 with n = 200, p = 2000, b = 1: coverage probability of 
confidence intervals of difi^erent levels for based on 10000 replications. 





p = 


p = 0.5 




80% 90% 95% 99% 


80% 90% 95% 99% 


Oracle 
RCV 


0.7967 0.8974 0.9476 0.9874 
0.7919 0.8928 0.9435 0.9847 


0.7931 0.9006 0.9483 0.9865 
0.8042 0.9022 0.9518 0.9871 



Example 3. We consider a more realistic model with 10 important predictors, detailed at 
beginning of this section. Since some non- vanishing coefficients are very small, no method can 
guarantee all relevant variables are chosen in the selected model, i.e. possess a sure screening 
property. To quantify the severity of missing relevant variables, we use the quantity Variance 
of Missing Variables (VMV), y&t{x^I3s) /a^ to measure, where S is the set of important 
variables not included in the selected model and /Bg is their regression coefficients in the 
simulated model. For RCV methods, the VMV is the average of VMVs for two halves of the 
data. Figure [8] summarizes the simulation results for {n,p) = (400,1000), whereas Figure 
|9] depicts the results for (n,p) = (400, 10000) when the penalization methods are not easily 
accessible. The naive methods seriously underestimate the variance and sensitive to the 
model selection tools, dimensionality, SNR, among others. In contrast, the RCV methods 
are much more stable and only slightly overestimate the variance when the sure screening 
condition is not satisfied. The one-step methods, especially plug in methods, perform also 
well. 

5.2 Real data analysis 

We now apply our proposed procedure to analyze a recent house price data from 1996-2005. 
The data set consists of 119 months of appreciations of national House Price Index (HPI), de- 
fined as the percent of monthly log-HPI changes in 381 Core Based Statistical Areas (CBSA) 
in the United States. The goal is to forecast the housing price appreciation (HPA) over those 
381 CBSAs over the next several years. Housing prices are geographically dependent. They 
depend also on macroeconomic variables. Their dependence on macroeconomic variables can 
be summarized by the national HPA. Therefore, a reasonable model for predicting the next 
period HPA in a given CBSA is 

381 

Yt = {3o + {3NXt-l,N + J2 l^iXt-i;i + St, (19) 

i=l 
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(a) 



(b) 




Figure 8: (a) The medians of various variance estimators when n = 400 and p = 1000 among 100 
rephcations for Example 3. (b) The medians of variance of missing variables of different model 
selection methods. 




Figure 9: (a) The medians of various variance estimators when n = 400 and p = 10000 among 100 
replications, (b) The medians of variance of missing variables of different model selection tools. 
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where stands for the national HPA, are the HPAs in those 381 CBSAs, and e is 

a random error independent of X. This is clearly a problem with the number of predictors 
more than the number of covariates. However, conditional on the national HPA X^, it is 
reasonable to expect that only the local neighborhoods have non-negligible influence, but 
it is hard to pre-determine those neighborhoods. In other words, it is reasonable to expect 
that the coefficients are sparse. 

Our primary interest is to estimate the residual variance o"^, which is the prediction error 
of the benchmark model. We always keep the variables Xjv and Xi, which is the lag 1 HPA 
of the region to be predicted. We applied the SCAD using the local linear approximation 



(jZou and Lil . l2008l ). which is the iteratively re- weighted LASSO, to estimate coefficients in 
f[T^ . We summarize the result, a, as a function of the selected model size s, to examine 
the sensitivity to the selected model size. Reported also is the percent of variance explained 
which is defined as 

= 1 



where Y is the sample average of the time series. For illustration purpose, we only focus on 
one CBSA in San Francisco and one in Los Angeles. The results are summarized in Table 3 
and Figure [TOl in which the naive two-stage method is also included for comparison. 

First of all, as shown in Figure [TOl the influence of the naive method by the selected model 
size is much larger than that of the RCV method. This is due to the spurious correlation as 
we discussed before. The RCV estimate is reasonably stable, but it is also influenced by the 
selected model size when it is large. This is understandable given the sample size of 119. 

In the case of San Francisco, from Figure [TOT b). the RCV method suggests that the 
standard deviation should be around 0.52%, which is reasonably stable for s in the range 
of 4 to 8. By inspection of the solution path of the naive two-stage method, we see that 
besides Xn and Xi, flrst selected is the variable X306, which corresponds to CBSA San 
Jose- Sunnyvale- Santa Clara (San Benito County, Santa Clara County). The variable X306 
also enters into both models when s > 3 in the RCV method. Therefore, we suggest that 
the selected model consist of at least variables Xi, X2 and X^qq. As expected, in the RCV 
method, the fourth selected variables are not the same for the two splitted subsamples. The 
variance explained by regression takes 79.83% of total variance. 

Similar analysis can be applied to the Los Angeles case. Figure fTUT d) suggests the stan- 
dard deviation should be around 0.50% (when s is between 7 and 10). From the solution 
path, we suggest that the selected model consist of at least variables X^v, Xi and X252 which 
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corresponds to CBSA Oxnard- Thousand Oaks- Ventura (Ventura County). The variance 
explained by regression takes 90.23% of total variance. 

Table 4: Estimated residual standard deviation and variance explained by regression (in percent) for naive 

two-stage and RCV methods for forecasting home price appreciation in San Francisco and Los Angeles. 

San Francisco 



Model size 


2 


3 


5 


10 


15 


20 


30 


Naive 


0.5577 


0.5236 


0.5072 


0.4555 


0.3938 


0.3862 


0.3635 


RCV 


0.5563 


0.5536 


0.5179 


0.5057 


0.4730 


0.4749 


0.4735 


variance explained 


76.92 


79.83 


81.40 


85.67 


89.79 


90.66 


92.58 






L( 


IS A.Llg<-lch 










Model size 


2 


3 


5 


10 


15 


20 


30 


Naive 


0.5236 


0.4887 


0.4583 


0.4401 


0.3747 


0.3137 


0.2503 


RCV 


0.5255 


0.5214 


0.5210 


0.4995 


0.4794 


0.4596 


0.4621 


variance explained 


88.68 


90.23 


91.56 


92.56 


94.86 


96.57 


98.05 



6 Discussion 

Variance estimation is important and challenging for ultrahigh dimensional sparse regression. 
One of the challenges is the spurious correlation: covariates can have high correlations with 
the realized noise and hence are recruited to predict the noise. As a result, the naive (natural) 
two-stage estimator seriously underestimates the variance. Its performance is very unstable 
and depends largely on the model selection tool employed. The RCV method is proposed 
to attenuate the influence of the effect of spurious variables. Both the asymptotic theory 
and empirical result show that the RCV estimator is the best among all estimators. It is 
accurate and stable, insensitive to the model selection tool and the size of the selected model. 
Therefore, we may employ fast model selection tool hke SIS for computational efficiency for 
the RCV variance estimation. We also compare the RCV method with the direct plug-in 
method. When choosing tuning parameters of a penahzed likehhood method like the LASSO, 
we suggest using a more conservative cross-validation rather than aggressive BIC. However, 
the LASSO method can still yield a non-negligible bias for variance estimation in ultrahigh 
dimensional regression. The SCAD method is almost as good as the RCV method, but it is 
computational more expensive than RCV-SIS. 

Appendix 

Notation and conditions 
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(a) SF:naive method 



(b) SF:RCV method 



OOOo 



OOOOoO 



~i 1 1 1 1 r 

5 10 15 20 25 30 



model size, s = 2 to 30 



oo 



oo. 



"~r~ 

10 



15 



~r 



~r 



20 25 
model size, s = 2 to 30 



30 



(o) LAinaive method 
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Figure 10: Estimated standard deviation of benchmark one-step forecast of home price appreciation 
in San Francisco and Los Angeles for various selected model size. The results are based on both 
the naive two-stage and RCV methods. 
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We first state the following assumptions, which are standard in the hteratures of high di- 
mensional statistical learning. For convenience, define 0inin("^) = niinM:|M|<m '^min(^X^XM) 
and 0max(m) = maxM:|M|<m Amax(^X|^XM), whcrc Ainin(A) and Ai„ax(A) denote the smallest 
and largest eigenvalues of a matrix A, respectively. 

For a vector v, we use standard natation ||v||p = and ||v||oo = niaxj{|fj|}. For 

a matrix B, we use three different norms. | |B| [2,00 is defined in Assumption (A8) below; | |B| I2 
denotes the usual operator norm, i.e. ||B||2 = max||i,||2<i ||B'y||2; ||B||oo — maXij{\Bij\} is 
the usual sup-norm. 

(Al) The errors Si, are i.i.d. with zero mean and finite variance cr^ and independent 
of the design matrix X. 

(A2) There exists a constant Aq > and 6„ such that — >■ such that -P|0min(^n) > 
Ao| = 1 for all n. 

(A3) There exists a constant L such that maxjj \Xij\ < L, where Xij is the element 
of the design matrix X. 

(A4) E{exp(|£i|/a)} < b for some finite constants a,b> 0. 

We have no intent to make the assumptions the weakest possible. For example. Assump- 
tion (A3) can be relaxed to maxj., |Xjj| < L(logn)^ for any ,^ > or further relaxation. The 
aim of the assumptions (A3) and (A4) is to guarantee that % in Theorem 1 is of the order 
y/slogp/n. 

Theorem 1 still holds under the random design with assumptions below. 

(A5) The random vectors Xi,...,x„ are i.i.d. and there exists a constant a such that 
E[exp{(|Xy l/p)"}] < L for all i,j and some constants a > 1, and p,L > 0, where X^j 
is the (i, j)th element of X. 

(A6) £1 satisfies that E[exp{(|£i|/a)^}] < b for some finite positive constants a,b,9 > 
and l/a + 1/9 < 1, where a is defined by Assumption (A5). 

For instance, when X^j and are sub-Gaussian {a — 9 — 2) for each i and j, the assumptions 
(A5) and (A6) are satisfied. 
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The following assumption (A7) is imposed for provi ng Theorem 3. For f i xed d esign 
matrix X, the corresponding condition is also imp osed in iMeinshausen and Yul (120091 ) and 
some discussions of weaker conditions are shown in iBickel et al.l ( l2009l ) . 



(A7) There exist constants < /cmin < k^ax < oo such that 

Pjliminf 0min(slogn) > k^^A = 1, 

n— >oo 

and 

P{limSUp0max(s + min{n,p}) < fcmax} = 1- 



The following two additional assumptions a re stated for provi ng Theorem 4. These 
conditions correspond to Conditions 4 and 5 in iFan and Lvl (120091 ) . Without loss of gen- 



erality, assume that the true value (3^ = (/3oi5/3o2) with each component of /3qi nonzero 
and /3g2 = 0. Let Xi and X2 be the submatrices of n x p design matrix X with columns 
corresponding to fB^i and /3q2, respectively. 



(A8) There exist constants < ci,C2 < 00 such that 

p|A^i„(ix[Xi) >ci}^l, 

and 

P{||ix^Xi||2,oo <C2} 

as n — )■ 00, where ||B||2,oo = niax||^||2<i ||Bi7||oo. 
(A9) Denote dn = | minj=i ... |/3oj|- Assume that dn > n^^logn with 7 G (0,1/2]. Take 

l — OQ 

An oc n ~ logn and Xn <^ dn, where ao is defined in Theorem HI 
Remark: The norm ||B||2,oo is somewhat abstract. It can easily be shown that 

||B||2,oo < sIlBlloo, 

where s is the number of columns of B, which is a crude upper bound. Using this and the 
argument in the proof of Theorem 4, if 

pjll^X^XilU <C3} 1 

and A„ > ■n,^(i^3"o)/2 [Qg^ g^^d A„ <^ dn, then the conclusion of Theorem H] holds. 
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A.l. Proof of Theorem [T] 



Part 1 follows the standard law of large numbers and central limit theorem. Now we 
prove the second part under assumptions (Al)- (A4). By Assumption (A2), 

1 

XoTi 

Let Xj denote the j-th column vector of the design matrix X. For a large constant c, consider 
the event £n = {Hiaxi<j<p |Xj£| < cy/n logp} . Under the event 8n, it follows from equation 
(|20|) that 

1 

Ao 

Together with the fact ?7.~^||£|p — )■ a^, we get 

7n = e^Pj^^e/e^e = Op{s\ogp/n). 

Hence it suffices to show that P{£n) — )■ 1 as n — t- oo for some constant c. Observe that, by 
Assumptions (A3)-(A4), for each j, 

ElAije^r < m!(La)'"Eexp{|ei|/a} < ^m\{2ha^L^){aL)'^-^ . 



e Pm^ < —sc^ log p. 



Using Bernstein's inequality (e.g. Lemma 2.2.11 of Ivan der Vaart and Wellneii (119961 ) ), we 
have 



P{^n} < P{ max |X[£| > cy/n\ogp\ 

p 



^ 2 -ex I c^nlogp | 

= 2exp{logp(l- \ )] (21) 

^ 4ba^L^c~^n-^ + 2aLc-^^y\ogp/n^ ^ 

For sufficient large c, we have 46a^L^c~^n~^ + 2aLc~^ ^^/]ogpJn < 1 since logp/n is 
bounded. Therefore, the power in fl2Tl) goes to negative infinity as p — )• oo. It follows that 
P{£4 = 1 - P{£^} ^ 1. 

Next we show the second part of the theorem still holds under Assumptions (A5)-(A6) 
instead of Assumptions (A3)-(A4). It is sufficient to verify P{Sn) — > 1 as n — )■ oo for some 
constant c. The key step is to establish the inequality 

E{\X,^er} < ^m!(8(2 + L + 6)pV)(2pa)'"-2, (22) 
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for each j = 1, ■ ■ ■ ,p. Note that 

P{\XY\ >t}< P{\X\ > t^/"} + P{\Y\ > t^-^/"} 
for a > 1 and random variables X and Y. Thus, for any t > 1 and each i,j, 



Xi. 



P 



>t)- < P 



Xij 



> t 



< Lexp{-t} + bexp{-t^^^-^/'^^} 

< (L + 6)exp(-t). 

If X is a nonnegative random variable with its distribution F{t) and tail probability P{X > 

t} < C exp(— t) for some constant C > and each t > 1, then by integration by parts 



Eexp(-X) = -j^ exp(-)d{l-F(x)} 



1 f°° r 
= 1 + 2/ {l-Fix)}exp{-)dx 

< 2 + C. 



As a result, it follows that, for each z, j. 



Xi 



Eexp<j - 



P 



<2 + {L + b). 



Thus, for each positive integer j and m > 2, 

E{\Xije,r} < (2pa)"m!Eexp<{ ■ 





Xij 








p 




a 



} 



< (2pa)™m!(2 + L + 6) 
= -m\(8{2 + L + b)p^a^y2pa) 



m-2 



Theorem 1 is proved. 



A.2. Proof of Theorem E] 

Define sequences of events Ani = {Mq C Mi}, An2 = {Mq C M2} and An = Ani H An2- 
On the event An-, we have 

(.(2)f(i„/,-p(^)).(2) (,(i))T(i^^^_pa)),(i) 
= = ' 
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where e^^^ and e^^^ correspond to y^-*^^ and y^^\ respectively. Decompose now {n/2 — si){aj — 
cr^) on the event An as 

We now prove (e^^))^?^ ^(2) _ ^^^^^ = Op{^x). 

First, consider the quadratic form 5" = ^'^P^ where P is a symmetric m x m matrix, 
^ — (Ci) ■ ■ ■ ) Cm)^ and {i — 1, - ■ • ,m) are i.i.d. Assume that E^i — 0, E^f = cr^ and the 
fourth moment E^^ < oo. Let be the (i, j)th element of the matrix P. Then, 

m 

E{S) = E{J2CiPii) = • trace{P), 
1=1 

and 



m 

2 



Var(5) = E{J2^i^MkPijPik)-c7'-{J2Pn) 

i,j,l,k i=l 



m 



- E(5^efi^)+E( y: &^p^^Plk)+ E( Yl ^^p^^Pi'^ 

i=l i=ly^j=k i=ky^j=l 

m m 

+E( E ililP^,Plk)-a'■{Y.P^^f 

i=jy^l=k i=l 

m m m m 

= ■ ( E ^) + ■ ( E 4) + ■ ( E p-p^^ - ■ (E p-f 

i=l i^j i^l 1=1 

m m 

i=l ij^j 

< (ECf + a'^) -trace (P2). 
where the last inequality holds since trace (P^) = YlTj Pfj- 

Observe that, trace(p|?') = trace{(P|?* )^} = Si. Hence, on the event Ani, we have 

E{(s«^)fp|s'«lxg;} = s,.^ 

and 



Var{(s(^)fpge(^)|xg;}<(£;e^ + a^)^, 



Using Markov inequality, it follows that, under the event Ani, 
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p p 

Combining with the assumptions Si/n — ^ and P{Ani) — ^ 1, we obtain that 



As a result, 



Similarly, we conclude that 



Therefore, using the last two results, we have 



n — 2si I 2 J n — 2s2 



1=1 



which implies that a/^('3"rcv ~ 

N(0, E[e'^] —cr^)- The Proof of Theorem 2 is completed. 



m 



To prove Theorem 3 , we wi ll use the followin g lemm a. The results are stated and proved 



Meinshausen and Yul (|2009[ ) and 



Bickel et al. 



Lemma 1 Consider the LASSO selector (3^ defined by with A„. Under the assumptions 
(A1)-(A4) and (Al), for A„ oc a^y\ogp/n, there exists a constant M > such that, with 
probability tending to 1 for n — )■ oo, 

and 

\\MPL-M\l<Ma\\ogp. 




A.3. Proof of Theorem S] 

(n — Si){a\ — (T^) can be decomposed as 



= {e^e - na^) - 2 ■ e^X^^ - /3o) + I |X(^^ - /3o) 
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The classical central limit theorem yields Ri = Op{n^^'^). Note that 

|i?2|<2-||X^s|| -ll^^-ZJolli. 



By (|2T|) and Lemma 1, it follows that 

I-R2I = Op{^Jn \ogp) ■ Op{s^/\ogp/n) = Op{s\ogp). 

In addition, by the third conclusion in Lemma 1, |i?3| = Op (s log p). Therefore, the conclu- 
sion holds. 



A. 4. Proof of Theorem [4] 

Let ^ = , 0"^)'^ with = (X"fXi)~^X"fy be the oracle estimator. The key step is to 

^ o 

show that, with probability tending to 1, the oracle estimat or /3 is a str i ctly lo cal minimizer 



of Qn\„{f^) defined by flTBl) . To prove it, by Theorem 1 of iFan and Lvl (120091 ) . it suffices to 
show that, with probability tending to 1, ;9 satisfies 

Xf(y-X^°)-np,J^i) = 0, (23) 
||Xj(y-X^°)|U< Va„(0+), (24) 

A,,in(-XfXi) >t^xM), (25) 
n 

where Pa„(A) = (sgn(/3i)pA„(|/3i|), ■ ■ ■ , sgn(/3,)p';^J|/3,|))^ and ka„(^i) = maxj=i,...,, { - 

Let $,1 = X^e and ^2 = ^2^- Consider the events 

•^ni = jll^illoo < Vnlognlog lognj fl |Amin(^XfXi) > cij and 

An2 = {||^2lloo < Vn"0+lloglogn} n |||^X^Xi||2,oo < C2}. 

Observe that = (XfXi)-iXfy. Then, we get ^-^ - [3^^ = (Xf Xi)-iXf e and hence, 
under the event Ani, 

||^i-/3oilloo < ||^i-/3oill2 

< \\i-X.JX,)-%\\-X.^s\\2 

n n 

< [A^in(-XrXi)]"'-v^-||-Xfs|U 

< CA/logn log logn/n^""" ^ A^, 
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for some constant c not depending on n. Note that, in the above inequahties, we use that 
facts s = 0{n'^°) and A„ oc n ~ \ogn. 

Since (i„ = iminj=i ... <, |/3oj| > n~'^\ogn with 7 G (0, 1/2] and dn ^ A„, as addressed in 
Assumption (A9), we have, under the event Ani, 

min \i3j\ > mm |/3oj| - ||^i - /^oilloo 

> 2 ■ dn — c^y \ogn\og log n/n^~"o 

> rf„ > A„ 

for sufficiently large n. As a result, this leads to P\^0i) = and kx„0i) = 0, and hence 
imply that (!23|) and (125|1 hold under the event Ani- 

Now turn to prove the inequality (|2^ . Under the event fl ^„2 5 we have 

"^-Xny-X^°)|U < -||^2lloo + -||X^Xi||2,oo||^i-/3oil|2 (26) 



n n n 

< a/ ?2"o-i log logn + C2Cv^log n log log n/n^""" 

oc A„(A/log logn/ logn + C2C-\/log logn/ logn) 

< iA„<p',„(0+) 

for sufficiently large n. This shows that the inequality ( 12^ holds for sufficiently large n 
under the event Ani H v4„2. By taking c = i/log log n, similar arguments to Theorem 1 lead 

to 

P{Anl n An2} 1 

as n — 7- 00. Thus, we have proven that ^ is a strictly local minimizer of Q„ a„(/3) with large 

^ o 

probability tending to one. Consequently, /3scad = /3 • 
Now consider the asymptotic distribution of 0"gcAD~'^^- 

Observe that /3i = (Xf X^^^Xf y. 

Under the event Ani H An2i 

^SCAD - = —^e^iln - Pa/o)^ - O-^- 

n — s 

Hence, we have that 

which also implies that (Tgc^D — o"^ = Op(n~^/^). The proof is complete. 



A. 5. Proof of (SI) and ([5]) 
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Let $(■) and F„_2(-) be the c.d.f. of standard Gaussian and student's t distribution with 
n — 2 degrees of freedom. For large u, 

1 - K-2iu) > 1 - ^(u) > exp(-M^). 

Therefore, u = A/log(p/c) satisfies Fn^2{u) < < 1 — c/p. The classical result that 

{^ni}j=i are i.i.d. t„_2 distribution entails that 

P\ sup ^nj>u\ = P\ sup F„_2(^nj) > ^n-2(M)} 
l<j<P ^ ^ ^<j<P 

= l-(l-F„_2Hr, 
which, by the choice of u, is further bounded from below by 

1 - (I-c/pY > 1 -exp(-c). 

Note that = ^nj/ (n — 2 + ^njY^'^ is strictly increasing. It follows that 

P \ sup -fnj > ^ \ = P{ sup Cnj > u] > 1 - exp(-c). 

[i<j<p [n-2 + u^YI^ } ^i<j<p J 

The result (jl]) follows from the fact that when < n + 2, 

u u 

< 



(n-2 + «2)V2 

We now derive the limiting distribution (jS]). For each x > 

P\ \/2\ogp( sup - dp) < x\ = P\ sup ^nj < dp + 

^ i<i<p J ^ i<i<p 



X 



<j<p J i<j<p V^iogP 

/•oo 

1 - / fn-2{t)dt 



Therefore, it suffices to show 

POO 

P / fn-2{t)dt ^ exp{-a;}. (27) 



Let 1/ = n — 2. The following inequalities are helpful to verify the limit ( 1271) 
where C{u) = —, — Substituting t = dp + —?= — into the inequalities fl28|) . it is easy to 



verify that under condition logp = 0(722' 



/>oo 

exp{— x} + 0(1) < p fi^it)dt < exp{— x} + o(l). 

Jdp+ ,„f 



This proves (1271) and hence ([5]). 
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